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Abstract 

The  purpose  of  this  document  is  to  report  the  progress  made  in  the  first  year  of  the  grant 
entided  "Investigation  of  Neural  Network  Dynamics”  (AFOSR-87-0354).  The  proposed  period  of 
the  work  was  September  1,  1987  to  August  31,  1990.  The  proposed  three  year  budget  was 
$126,200  with  a  first  year  budget  of  $40,000. 

The  grant  was  closed  after  a  single  year  because  the  principal  investigator  moved  from 
the  Applied  Physics  Laboratory,  Johns  Hopkins  University  to  the  Jet  Propulsion  Laboratory, 
California  Institute  of  Technology.  Nevertheless,  many  of  the  initial  objectives  were  met  in  the 
single  year  that  the  grant  was  in  force. 

The  major  result  of  this  investigation  is  a  systematic  approach  for  exploiting  the  dynamics 
of  a  general  class  of  neurodynamical  systems  for  the  purpose  of  neural  computation.  We  have 
interpreted  the  back-propagation  formalism  as  an  adaptive  algorithm  for  a  general  class  of 
dynamical  systems.  The  completely  continuous  formalism  lends  itself  to  implementation  in  analog 
VLSI.  This  can  be  accomplished  without  external  synchronization. 

Four  papers  were  published  which  acknowledged  the  grant.  Three  were  published  in 
refereed  Journals,  the  third  in  a  refereed  conference.  One  student  was  partially  supported  by  the 
grant.  The  work  received  wide  recognition  and  acceptance  from  the  scientific  and  technical 
community. 
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Introduction 


This  document  reports  on  progress  made  in  the  first  year  of  the  grant  entitled 
"Investigation  of  Neural  Network  Dynamics"  (AFOSR-87-0354).  The  proposed  period  of  the 
work  was  September  1,  1987  to  August  31,  1990.  The  proposed  three  year  budget  was  $126,200 
with  a  first  year  budget  of  $40,000.  The  term  of  the  first  year  grant  was  extended  until 
January  31,  1989. 

2 . 1  Research  Objectives/Statement  of  work 

To  achieve  a  substantial  improvement  in  the  understanding  of  neural  network  dynamics  a 
program  based  on  numerical  simulation  and  formal  analysis  was  proposed.  The  investigation  was 
based  on  several  broad  issues  which  were  address: 

1)  What  are  the  formal  relationships  between  the  various  neural  network  models?  Models  were 
expected  to  fall  into  equivalence  classes  which  displayed  qualitatively  similar  behavior.  It  was  of 
interest  to  identify  the  minimal  models  in  each  class  and  to  identify  the  most  general  models  in  each 
class. 

2)  It  was  proposed  to  investigate  the  conjecture  that  the  convergence  of  some  networks  was  an 
emergent  property  —  that  is,  the  convergence  of  the  system  improved  in  the  limit  of  very  many 
processing  units. 

3)  It  was  proposed  to  investigate  the  storage  capacity  of  various  neural  networks.  A  detailed 
analysis  of  the  information  storage  capacity  of  feedforward  networks  had  been  performed  by  Baum 
et  al.  [BMW87].  We  were  interested  in  extending  these  results  to  networks  with  feedback. 

4)  It  was  proposed  to  investigate  how  particular  models  map  onto  particular  machines  and  what 
models  lend  themselves  to  implementation  in  VLSI.  Ideally,  a  model  suitable  for  implementation  in 
VLSI  would  keep  a  maximal  amount  of  silicon  busy.  This  means  that  as  many  units  as  possible 
must  take  an  active  role  in  the  computation.  A  desirable  property  is  for  each  node  to  be  able  to 
process  its  information  asynchronously.  We  were  interested  in  investigating  models  which  possess 
this  property. 
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5)  It  was  proposed  to  investigate  the  role  of  characteristic  time  scales  in  neural  networks.  There 
are  at  least  three  time  scales  which  play  a  role  in  neural  networks.  The  existance  of  three  time  scales 
can  be  inferred:  1)  signal  propagation  time,  xp,  in  the  brain  this  corresponds  to  the  1-10  mS  which 
is  the  time  it  takes  signals  to  propagate  across  the  cortex;  2)  time  of  input  fluctuations,  Tp  the  brain 
responds  to  inputs  from  the  external  environment  which  fluctuate  with  their  own  characteristic  time 
scales;  3)  time  scale  of  microscopic  learning,  tl,  in  the  brain  the  strength  of  a  synaptic  connection 
changes  with  time. 

6)  It  was  proposed  to  investigate  how  algorithms  scale  in  the  limit  of  many  processors  (large-N 
limit).  An  important  field  of  research  at  the  present  time  is  the  development  of  faster  learning 
algorithms  which  can  be  scaled  up  to  networks  with  millions  of  modifiable  connections. 

2.2  Personnel 

The  investigations  were  carried  out  by  the  following  personnel:  The  principal  investigator  for 
this  program  was  Dr.  Fernando  J.  Pineda.  During  the  first  year  of  the  grant  he  was  a  member  of 
the  Senior  Professional  Staff  in  the  Space  Department  at  JHU/APL.  Currently  he  is  a  member  of 
the  Technical  Staff  at  the  Jet  propulsion  Laboratory,  California  Institute  of  Technology.  Dr.  Pineda 
received  his  Ph.D.  in  Theoretical  Nuclear  Physics  from  the  University  of  Maryland  in  December 
1986.  Part-time  support  was  provided  for  a  very  talented  undergraduate  student  from  Johns 
Hopkins  University:  Mr.  A.  David  Redish.  In  addition,  the  P.I.  worked  with  two  students:  Mr. 
Ben  Yuhas  of  the  Deptartment  of  Computer  and  Electrical  Engineering  at  Johns  Hopkins 
University,  and  Mr.  Etienne  Duprit  of  the  Naval  Research  Laboratories.  Mr.  Duprit  published  his 
class  project  in  the  refereed  journal  Neural  Networks [De89]. 

2.3  Status  of  work 

This  section  addresses  the  progress  made  in  the  first  year.  Most  of  the  progress  was  made  in 
the  area  of  adaptive  algorithms  based  on  the  Recurrent  Backpropagation  (RBP)  algorithm.  The 
progress  made  during  the  first  year  was  reported  in  the  following  three  papers: 

1 )  Generalization  of  back-propagation  to  recurrent  neural  networks, 

F.  J.  Pineda,  Physical  Review  Letters,  18,  pp.  2229-2232,  (1987) 

2)  Generalization  of  back -propagation  to  recurrent  and  higher-order  neural  networks, 

F.  J.  Pineda,  (to  appear),  Proceedings  of  IEEE  Conference  on  Neural  Information 
Processing  Systems,  (Dana  Z.  Anderson,  ed.),  Denver  Colorado,  Nov.  8-12,  (1987) 
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3)  Dynamics  and  Architecture  in  neural  computation, 

F.  J.  Pineda,  (to  appear),  Journal  of  Complexity,  Special  Issue  on  Neural  Networks, 
September ,  (1988) 

In  addition  to  the  above,  some  work  was  reported  at  the  "Neural  Networks  for  Computing" 
conference  held  in  Snowbird  Utah,  April  6-10,  (1988).  Finally,  after  the  grant  had  expired,  the 
author  wrote  a  mini-review  article  entided  "Recurrent  Backpropagation  and  the  Dynamical  Approach 
to  Neural  Computation"  [Pi89].  It  summarized  the  work  performed  to  date  and  its  significance.  It 
is  included  in  the  appendix. 

Some  of  the  key  results  of  the  investigation  are  now  summarized.  The  reader  may  refer  to 
the  articles  in  the  appendices  for  detailed  information. 

1)  Our  investigtion  into  adapative  algorithms  was  based  on  a  formalism  for  constructing  general 
adaptive  dynamical  systems  which  obey  nonlinear  coupled  differential  equations.  This  formalism, 
denoted  Recurrent  back-propagation  (RBP)  is  a  non-algorithmic  continuous-time  formalism  for 
adaptive  recurrent  and  nonrecurrent  networks  in  which  the  physical  aspects  of  the  computation  arc 
stressed  [Pi87a,  Pi87b  and  Pi88].  The  formalism  is  expressed  in  the  language  of  differential 
equations  so  that  the  connection  to  collective  physical  systems  is  more  natural.  RBP  can  be  put  into 
an  algorithmic  form  to  optimize  the  performance  of  the  network  on  digital  machines,  nevertheless, 
as  shall  be  discussed  below,  the  intent  of  the  formalism  is  to  stay  as  close  to  collective  physics  and 
dynamics  as  possible. 

The  class  of  neural  network  models  which  can  be  trained  by  RBP  is  very  general,  but  most 
of  our  work  has  focused  on  a  simple  system  given  by  the  following  differential  equations: 

Xxdx,/dt  =  -Xj  +  I  Wjj  f(xj)  +  Ij  i-  i , . . . ,  n  .  (1) 

j 

The  vector  x  represents  the  state  vector  of  the  network,  I  represents  an  external  input  vector  and  w 
represents  a  matrix  of  coupling  constants  (weights)  which  represent  the  strengths  of  the  interactions 
between  the  various  neurons.  The  relaxation  time  scale  is  xx.  By  hypothesis,  the  vector  valued 
function  f(xj)  is  differentiable  and  chosen  so  as  to  give  the  system  appropriate  dynamical  properties. 
For  example,  biologically  motivated  choices  are  the  logistic  or  hyperbolic  tangent  functions  [Co68]. 
When  the  matrix  w  is  symmetric  this  system  corresponds  to  the  Hopfield  model  with  graded  neurons 
[Ho84], 

In  general ,  the  solutions  of  equation  ( 1 )  exhibit  oscillations,  convergence  onto  isolated  fixed 
points  and  chaos.  For  our  purposes,  convergence  onto  isolated  fixed  points  is  the  desired  behavior, 
because  we  use  the  value  of  the  fixed  point  as  the  output  of  the  system.  When  the  network  is  loaded, 
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the  weights  are  adjusted  so  that  the  output  of  the  network  is  the  desired  output.  Recurrent 
Back-propagation  dynamics  is  based  on  gradient  descent  and  exploits  two  tricks  to  reduce  the  amount 
of  computation.  The  first  trick  uses  the  fact  that,  for  equations  of  the  form  (1),  the  gradient  of  an 
objective  function  E(x°)  can  be  written  as  an  outer-product,  i.e. 

VWE  =  y°f  (x°)  T  .  (2) 

Where  x°  is  the  fixed  point  of  eqn.  (1)  and  where  the  "error  vector"  y°  is  given  by 

y°=  (Lt)  _1J  (3) 

where  LT  is  the  transpose  of  the  matrix  n  x  n  matrix  whose  components  are 

Lij  =  5ij  -  wjj  f(xj) . 

J  is  an  external  error  signal  which  depends  on  the  objective  function  and  on  x°.  This  trick  reduces 
the  computational  complexity  of  the  gradient  calculation  by  a  factor  of  n  because  L'l  can  be  calculated 
by  direct  matrix  inversion  in  O(n^)  operations  and  because  x°  can  be  calculated  in  only  0(n2) 
calculations.  Thus  the  entire  calculation  scales  like  O(n^)  or  0(n3/2).  The  second  trick  exploits  the 
fact  that  y°  can  be  calculated  by  relaxation  or  equivalently  it  is  the  (stable)  fixed  point  of  the  following 

coupled  set  of  linear  differential  equation: 

Tydyi/dt  =  -yj  +f(xi)Iwjiyj  +  Ji  (4) 

j 

This  equation  was  derived  by  [Pi 87].  A  discrete-time  version  was  derived  independently  by  Almeida 
[A187], 

2)  The  recurrent  backpropagation  formalism  provides  a  single  unified  approach  for  feed-forward 
type  networks  and  recurrent  networks.  This  approach  is  very  powerful  and  can  be  applied  to  a 
variety  of  dynamical  systems.  Another  consequence  of  the  unified  approach  is  that  it  is  possible  to 
build  heirarchical  architectures  containing  both  associative  memory  and  feed-forward  components 
from  homogenous  units.  This  was  demonstrated  in  [Pi88]. 

3)  The  role  of  characteristic  time  scales  in  the  adaptive  algorithms  was  investigated.  In  many 
paradigms  of  neural  computation  (e.g.  conventional  backpropagation)  characteristic  time  scales  do  not 
play  a  role.  Therefore  it  is  impossible  to  gauge  how  fast  a  "true"  neural  machine  would  perform  the 
same  task.  (For  this  purpose  a  "true"  neural  machine  is  one  whose  basic  functional  units  implement 
the  appropriate  neurodynamics  directly  -as  typified  by  the  analog  VLSI  approach  taken  by  Mead 
[Me89].  Constraints  on  various  time  scales  were  derived  and  published  [Pi88]. 
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4)  We  have  shown  how  the  algorithm  can  be  implemented  as  a  continous-time  dynamical  system. 
This  is  important  because  it  permits  the  possibility  of  implementing  the  dynamical  learning  algorithms 
in  analog  VLSI  without  the  need  for  an  internal  or  external  clock.  Our  formalism  has  allowed  us  to 
estimate  the  performance  of  analog  VLSI  implementations  of  backpropagation  networks.  Preliminary 
results  were  presented  at  the  Snowbird  conference  [Pi88].  The  performance  of  an  electronic  physical 
system  was  estimated  by  using  electronic  time  scales  for  the  characteristic  time  scales.  For  a  simple 
example,  it  was  estimated  that  learning  could  be  accomplished  in  approximately  10  milliseconds.  This 
compared  very  well  to  a  digital  simulation  that  required  several  minutes  to  converge. 

5)  We  proposed  to  investigate  how  particular  models  map  onto  particular  machines  and  what 
models  lend  themselves  to  implementation  in  VLSI.  Mr.  Etienne  Duprit,  a  student  taking  a  course 
taught  by  the  P.I.  implemented  the  recurrent  backpropagation  algorithm  on  a  Connection  Machine 
[De88].  Mr.  Duprit,  experimented  with  two  different  implementations  of  the  algorithm.  One 
implementation  was  based  on  a  generalization  of  the  implementation  of  Rosenberg  and  Blelloch 
[RB88].  This  implementation  devoted  CM  processors  to  connections  as  well  as  neurons.  The 
second  implementation  was  based  on  a  routing  algorithm  developed  by  Tomboulian  [To86]  which 
used  processors  for  neurons  only.  Mr.  Deprit  concluded  that  the  Tomboulian  algorithm  spends  most 
of  it's  time  communicating  whereas  the  Rosenberg  and  Blelloch  algorithm  spends  most  of  its  time 
computing.  Furthermore,  for  networks  where  the  fan-in  gets  very  large  the  the  Tombolian  algorithm 
required  processors  with  memory  size  proportional  to  fan-in.  The  most  effective  use  is  made  of 
hardware  if  the  connections  themselves  can  be  made  to  perform  computation  rather  than  the 
processors.  In  otherwords  synapses  must  be  "smart"  memories. 

6)  We  Investigated  how  the  gradient  calculation  algorithm  scales  with  the  number  of  processors. 
In  gradient  descent  learning,  the  computational  problem  is  to  optimize  an  objective  function  whose 
free  parameters  are  the  weights.  Let  the  number  of  weights  be  denoted  by  "N"  and  let  the  number  of 
processing  units  be  denoted  by  "n”.  Then,  N  is  proportional  to  n^  if  the  fan-in/fan-out  of  the  units  is 
proportional  to  n.  In  a  neural  network  the  evaluation  of  an  objective  function  requires  O(N)  or  O(n^) 
operations.  Accordingly,  to  calculate  the  gradient  of  the  objective  function  by  numerical 
differentiation  requires  O(N^)  or  O(n^)  calculations.  For  big  problems,  i.e.  problems  with  lots  of 
connections  this  becomes  intractable  very  rapidly. 

To  relax  y  (i.e.  to  integrate  eqn.  (4)  until  y  reaches  steady  state)  requires  0(n2)  operations  per 
time  step.  The  number  of  time  steps  is  independent  of  n.  Therefore  the  calculation  of  y°  is  0(n2)  or 
O(N).  The  method  is  computationally  efficient  provided  the  network  is  sufficiently  large  and  sparse 
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and  provided  that  the  fixed  points  are  not  marginally  stable.  These  results  are  summarized  in  table  1 . 
Note  that  the  two  back-propagation  algorithms  scale  like  0(N),  but  this  hides  the  constants  of 
proportionality  which  for  feed-forward  networks  depends  on  the  number  of  layers  where  as  for 
recurrent  networks  the  constant  of  proportionality  depends  strongly  on  the  eigenvalues  of  the  L 
matrix.  Indeed,  if  the  fixed  points  are  marginally  stable,  the  number  of  iterations  required  to  converge 
onto  x°  and  y°  may  diverge.  The  relationship  between  various  algorithms  is  shown  in  the  table 
below 

numerical  algorithm 

Worst  case  (e.g.  numerical  differentiation) 
matrix  inversion  (e.g.  gaussian  elimination) 
matrix  inversion  by  relaxation  (e.g.  RBP) 
recursion  (e.g.  classical  feed-forward  back-propagation) 

Table  1.  Scaling  of  gradient  calculation  with  the  number  of  connections 

The  scaling  referred  to  here  should  not  be  confused  with  the  number  of  gradient  evaluations 
required  for  convergence  to  a  solution.  Indeed,  for  some  problems,  e.g.  parity,  the  required  number 
of  gradient  evaluations  may  diverge  at  critical  training  set  sizes  [Te87]. 

The  O(N)  scaling  of  the  gradient  calculation  is  arguably  the  single  most  important  reason  that 
back-propagation  algorithms  have  made  such  an  impact.  The  idea  of  using  gradient  descent  is 
certainly  not  new,  but  whereas  it  was  previously  tractable  on  small  problems  only,  it  is  now  tractable 
on  big  problems  to.  It  is  interesting  to  observe  that  a  similar  situation  arose  after  the  development  of 
the  FFT  algorithm.  The  idea  of  numerical  fourier  transforms  had  been  around  for  a  long  time  before 
the  FFT,  but  the  FFT  caused  a  computational  revolution  by  reducing  the  complexity  of  an  n-point 
fourier  transform,  from  O(n^)  to  0(n-log(n) ). 

3.  Related  funding  actions 

3.1  AFOSR 

A  proposal  to  continue  the  work  in  this  grant  has  been  submitted  to  the  AFOSR  through  the 
Jet  Propulsion  Laboratory,  California  Insititute  of  Technology.  The  new  proposal  is  entitled: 
"Adaptive  Dynamics  for  Neural  Computation.",  J PL  Task  plan  No.  80-3095 


com 


plexity 

°(N& 

0(N3/2) 

0(N) 

0(N) 


7 


3.2  Recognition  of  work  by  the  community 


Recurrent  Back-propagation  has  proven  to  be  a  rich  and  useful  computational  tool.  Qian  and 
Sejnowski  [QS88]  have  demonstrated  that  a  recurrent  back-propagation  network  can  be  trained  to 
calculate  stereo  disparity.  This  results  in  a  network  similar  to  that  of  Marr  and  Poggio  [MP76]. 
Barhen  et  al.  [BGZ89]  have  used  the  method  to  train  networks  on  inverse  kinematics  for  robotic 
applications.  The  formalism  has  also  been  fertile  soil  for  theoretical  developements.  Pearlmutter 
[Pe88]  has  extended  the  technique  to  time  dependent  trajectories  while  Simard  and  Ballard  [SB88] 
have  investigated  its  convergence  properties. 

The  principal  investigator  has  been  invited  to  present  talks  based  on  this  work  at  the 
University  of  Chicago,  (Depts.of  Mathematics  and  Computer  Science),  The  California  Institute  of 
Technology,  (Dept,  of  Electrical  Engineering).  Prof.  Carver  Mead  at  Caltech  has  expressed  an 
interest  in  applying  these  ideas  to  the  construction  of  an  actual  analog  VLSI  learning  chip.  Most 
recently,  the  P.I.  has  been  invited  to  contribute  a  chapter  to  a  book  devoted  to  the  backpropagation 
algorithm  (  edited  by  D.  Rumelhart  and  E.  Chauvin). 
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An  adaptive  neural  network  with  asymmetric  connections  is  introduced.  This  network  is  related  to  the 
Hopfield  network  with  graded  neurons  and  uses  a  recurrent  generalization  of  the  S  rule  of  Rumelhart, 
Hinton,  and  Williams  to  modify  adaptively  the  synaptic  weights.  The  new  network  bears  a  resemblance 
to  the  master/slave  network  of  Lapedes  and  Farber,  but  it  is  architecturally  simpler. 


PACS  numbers:  87.30.Gy 

The  neural  network  approach  is  a  paradigm  for  com¬ 
putation  in  which  the  traditional  paradigm  of  a  finite- 
state  machine  performing  sequential  instructions  in  a 
discrete  state  space  is  replaced  with  the  paradigm  of  a 
dynamical  system,  in  a  discrete  or  continuous  state 
space,  which  evolves  under  the  control  of  a  certain  class 
of  dynamics  ( neurodynamics ).  Although  a  precise 
definition  of  neurodynamics  does  not  exist,  it  seems  safe 
to  characterize  it  by  at  least  three  salient  features.  First, 
the  dynamical  system  has  very  many  degrees  of  freedom. 
At  the  present  time,  most  simulations  of  these  systems 
are  limited  to  less  than  105  neurons.  On  the  other  hand, 
the  human  brain  has  at  least  10 11  neurons.  The  activity 
level  and  the  time  derivative  of  the  activity  of  the  neu¬ 
rons  are  the  coordinates  in  the  phase  space  of  the  system. 
This  phase  space  plays  the  role  of  the  state  space  in  a 
conventional  computing  machine.  The  second  feature  of 
ncurodynamics  is  nonlinearity.  Nonlinearity  is  essential 
to  create  a  universal  computing  machine.  This  follows 
because  a  network  composed  of  linear  units  can  always 
be  reduced  to  an  equivalent  single-layer  network  which 
performs  the  same  input/output  transformation.  But,  as 
pointed  out  by  Minsky  and  Papert, 1  a  universal  comput¬ 
ing  machine  cannot  be  built  from  a  single  layer  of  finite- 
order  neurons.  The  third  feature  of  neurodynamics  is 
that  it  is  dissipative.  A  dissipative  system  is  character¬ 
ized  by  the  convergence  of  the  phase-space  volume  onto 
a  manifold  of  lower  dimensionality  as  time  increases. 
Systems  whose  flow  exhibits  the  property  of  global 
asymptotic  stability  play  a  particularly  important  role  in 
neural-network  modeling.  Global  asymptotic  stability 
implies  that  the  system  will  ultimately  settle  down  to  a 
steady  state  for  any  choice  of  initial  condition.  Systems 
which  minimize  an  energy  function,  such  as  the  Hopfield 
model,  are  guaranteed  to  be  globally  asymptotically 
stable. 

The  identification  of  stable  fixed  points  with  computa¬ 
tional  objects,  e.g.,  memories,  is  one  of  the  fundamental 
ideas  of  the  paradigm.  To  implement  this  idea  it  is 
necessary  to  control  the  locations  of  the  fixed  points  of 
the  neural  networks.  A  learning  algorithm  is  a  rule  or 
dynamical  equation  which  changes  the  locations  of  fixed 
points  to  encode  information.  One  way  of  doing  this  is  to 
minimize,  by  gradient  descent,  some  function  of  the  sys¬ 


tem  parameters.  This  general  approach  is  reviewed  by 
Amari2  and  forms  the  basis  of  many  learning  algorithms. 
The  algorithm  described  here  is  a  specific  case  of  this 
general  approach. 

The  dynamics  of  the  network  considered  in  this  Letter 
is  based  on  the  following  system  of  coupled  differential 
equations: 

dxjdt  -  -  ax,  +Pft  jX,  w/jXj  j  +/,,  (1) 

where  x,  represents  the  activity  of  the  ith  neuron,  where 
the  matrix  element  wtj  denotes  the  connection  strength, 
or  coupling,  Irom  the  y'th  to  the  ith  neuron,  and  where  a 
and  p  are  conveniently  chosen  positive  constants.  The 
functions  f  may  have  different  forms  for  various  popula¬ 
tions  of  neurons.  A  commonly  used  form  is  the  logistic 
function. 

The  constant  It  represents  an  external  input  bias  which 
may  be  included  inside  or  outside  /(£).  I  chose  the 
latter  case  arbitrarily.  The  fixed  points  of  (1),  which  I 
denote  as  x°,  are  solutions  of  the  nonlinear  algebraic 
equations 

ax° “ Pft  [Xy  WUXJ° ]  +I<  (2) 

and  are  implicit  functions  of  the  weight  matrix  w  and  in¬ 
itial  state  x'. 

Suppose  that  w  is  lower  triangular.  Then  it  is  clear 
that  Eq.  (2)  can  be  solved  recursively  since  to  calculate 
xt  one  needs  only  X|, . . .  ,xt- 1.  Thus,  when  the  units  are 
properly  labeled,  this  is  just  the  forward  propagation 
which  occurs  in  the  widely  used  feedforward  network  of 
Rumelhart,  Hinton,  and  Williams.3  I  conclude  that  the 
feedforward  network  simply  provides  a  direct  method  of 
calculating  the  fixed  points  of  (1)  when  w  is  lower  tri¬ 
angular. 

The  S  rule  is  a  learning  rule  for  feedforward  networks. 
Strictly  speaking,  it  is  restricted  to  feedforward  networks 
only.  Nevertheless  it  has  been  applied  to  recurrent  net¬ 
works  by  taking  advantage  of  the  fact  that  for  every  re¬ 
current  network  there  exists  an  equivalent  feedforward 
network  (for  a  finite  time).  The  cost  for  this  strategy  is 
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the  manyfold  duplication  of  the  hardware  for  the  feed¬ 
forward  version  of  the  recurrent  network.3  The  algo¬ 
rithm  presented  in  this  paper  makes  unnecessary  the  ar- 
tiface  of  unfolding  a  recurrent  network  into  a  feedfor¬ 
ward  network. 

A  necessary  condition  for  the  learning  algorithm  dis¬ 
cussed  in  this  Letter  to  exist  is  that  system  (1)  reach 
steady  state  (I  will  not  discuss  limit  cycles  here).  Except 
for  some  theorems  concerning  collective  quantities,4  little 
is  known  about  the  stability  of  system  (1)  for  arbitrary 
w.  However,  there  are  special  cases  for  which  (1)  can  be 
proved  to  be  globally  asymptotically  stable.  The  set  of 
equations  (1)  is  stable  if  w  is  symmetric  because  (1)  can 
be  transformed  into  the  equations  studied  by  Hopfield5 
under  the  coordinate  transformation. 


Ui  “X*  wikXk- 

Hopfieid’s  equations  are  globally  asymptotically  stable  if 
w  is  symmetric  and  has  zeros  along  the  diagonal.  Stabil¬ 
ity  in  this  case  is  proved  because  there  exists  a  Liapunov 
function.  A  general  theorem  concerning  stability  of  net¬ 
works  with  symmetric  weights  is  given  by  Cohen  and 
Grossberg.6  The  set  of  equations  (1)  is  also  globally 
asymptotically  stable  if  w  is  lower  triangular  because  in 
such  a  case  the  network  is  a  pure  feedforward  network. 
In  other  words  the  nth  unit  can  only  receive  input  from 
the  mth  unit  if  n>  m.  The  stability  of  the  feedforward 
case  follows  from  a  recursive  agreement  which  goes  as 
follows.  Suppose  that  the  activations  xt  (where  /' "l, 

. . .  ,m )  are  constant.  Then  from  the  feedforward  con¬ 
straint  the  nth  unit  (where  n  —m  +  1 )  receives  only  con¬ 
stant  input.  With  constant  input  Eqs.  (1)  converge  ex¬ 
ponentially  to  a  constant  value,  and  hence  xm+i  becomes 
constant.  Thus  it  is  clear  that  if  the  inputs  are  constant, 
the  activation  of  the  entire  network  will  ultimately  be¬ 
come  constant.  Equations  (1)  are  also  stable  in  the  limit 
of  infinite  w  since  if  w  is  infinite  the  function  /(u,  )  be¬ 
comes  constant  and  the  solutions  to  (1)  simply  decay  ex¬ 
ponentially  to  constants. 

Numerical  simulations  conducted  by  this  author 
strongly  suggest  that  in  practice  the  system  is  stable  for 
most  w  and  initial  x.  Oscillatory  solutions  can  occur 
when  there  exists  substantial  self-excitation.  It  shall  be 
assumed,  for  the  purpose  of  deriving  the  back-propaga¬ 
tion  equations,  that  the  system  ultimately  settles  down  to 
a  stable  state.  With  this  caveat  in  mind  I  present  the  re¬ 
current  back-propagation  (RBP)  algorithm. 

Consider  a  system  of  neurons,  or  units,  whose  dynam¬ 
ics  is  determined  by  Eqs.  (I).  Of  all  the  units  in  the  net¬ 
work  we  will  arbitrarily  define  some  subset  of  them,  A, 
as  input  units  and  some  other  subset  of  them,  ft,  as  out¬ 
put  units.  Units  which  are  neither  members  of  A  or  ft 
are  denoted  hidden  units.  A  unit  may  be  simultaneously 
an  input  unit  and  an  output  unit.  If  a  unit  is  an  input 
unit,  the  corresponding  component  of  the  vector  I  is 
nonzero  and  represents  an  external  input  to  the  system. 


i.e., 


6.  if/SA, 
0,  otherwise, 


where  is  an  external  input. 

Our  goal  will  be  to  find  a  local  algorithm  which  ad¬ 
justs  w  so  that  a  given  fixed  initial  state  x'  and  a  given 
set  of  input  values  <J,  result  in  a  fixed  point,  x°,  whose 
components  along  the  output  units  have  a  desired  set  of 
values,  r j  (where  j  €  n).  This  will  be  accomplished  by 
our  minimizing  a  function  which  measures  the  error  be¬ 
tween  the  desired  fixed  point  and  the  actual  fixed  point. 
Consider  the  positive  definite  function 

£(x<W  £  J,2, 

1-1 

where 


Ji 


t(—  x°,  if  i  €  ft, 
0,  otherwise. 


It  is  an  implicit  function  of  the  weight  matrix  w  because 
the  fixed  point  x°  is  implicitly  dependent  on  the  weight 
matrix.  £(x°)  has  a  family  of  minima  which  exist  on 
the  hyperplanes  which  satisfy  where  j  €  ft. 

A  formal  learning  algorithm  consists  of  an  algorithm 
which  drives  the  fixed  point  towards  one  of  these  hyper¬ 
planes.  Dynamically,  this  is  accomplished  by  our  letting 
the  system  evolve  in  the  weight  space  along  trajectories 
which  are  antiparallel  to  the  gradient  BE/Bwtj.  In  other 
words. 


d*/jj/dt  *■  —  rj  dE/dwjj,  (3) 

where  77  is  a  numerical  constant  which  defines  the  (slow) 
time  scale  on  which  w  changes,  rj  must  be  small  so  that 
x  is  always  essentially  at  steady  state  (i.e.,  tax0).  On 
performing  the  differentiations  in  (3)  one  immediately 
obtains 

dwjdt  (4) 

The  derivative  of  x°  with  respect  to  wn  is  obtained  by 
our  differentiating  both  sides  of  (2)  with  respect  to  wn 
and  solving  for  the  derivatives.  The  result  is 

dxS/Bw„  -0(L-1  )*,/,'(«,  )x,°,  (5) 

where  the  matrix  L  is  given  by 

Ly  "  aSfJ  —  )wy , 

and  where  Stj  is  the  Kronecker  S  symbol.  On  substitut¬ 
ing  (5)  into  (4)  one  immediately  obtains 

dwjdt -Tjyrx ,°,  (6) 


where 

»-£/,'(«,  )£*/*(L  (7) 
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Equations  (6)  and  (7)  specify  a  formal  learning  rule. 
Equations  (7)  require  a  matrix  inversion  to  calculate  the 
error  signals,  y* .  Direct  matrix  inversions  are  necessari¬ 
ly  nonlocal  calculations  and  therefore  this  learning  algo¬ 
rithm  is  not  suitable  for  implementation  as  a  neural  net¬ 
work.  A  local  method  for  the  calculation  of  y,  is  ob¬ 
tained  by  the  introduction  of  an  associated  dynamical 
system.  Consider  the  vector  z  whose  components  are 
defined  in  terms  of  the  components  of  y  according  to 

yr  -pf/(ur)zr-  (8) 

Equations  (7)  and  (8)  imply  that  zr  satisfies 

'Z'LriZ'-Ji.  (9) 

Now  observe  that  the  solutions  of  Eqs.  (9)  are  the 
steady-state  solutions  of 

dzjdt  -  L„zr+Ji. 

In  terms  of  the  explicit  variables  in  the  problem,  these 
equations  are 

dzi/di  ™  -  az,  +/}£,  {/,'(ur)w„2,} +7,.  (10) 

This  leads  to  a  learning  rule  of  the  form 

d*>rs/dt  “  rjf'iur  )z°X°.  ( 1  1 ) 

Equations  (1),  (10),  and  (11)  completely  specify  the 
dynamics  for  an  adaptive  neural  network,  provided  that 
(1)  and  (10)  are  convergent.  It  is  known  that  the  con¬ 
vergence  of  (1)  is  a  sufficient  condition  for  the  conver¬ 
gence  of  (10).7  This  follows  from  the  observation  that 
the  back-propagation  network  is  obtained  from  the 
forward-propagation  network  (linearized  about  a  fixed 
point)  and  that  a  linear  network  is  stable  in  both  direc¬ 
tions  if  it  is  stable  in  either  direction.  It  is  quite  easy  for 
one  to  obtain  the  <5  rule  from  the  RBP  algorithm  by  ex¬ 
pressing  Eqs.  (1),  (10),  and  (11)  as  difference  equations 
with  At  “  1  and  with  w  lower  triangular. 

I  have  conducted  preliminary  numerical  experiments 
with  exclusive  OR  (XOR)  networks  to  verify  the 
correctness  of  the  algorithm.  These  were  performed  by 
my  approximating  the  differential  equations  with  first- 
order  finite-difference  equations  and  requiring  that  Eqs. 
(1)  and  (10)  converge  before  taking  an  integration  step 
in  Eq.  (11).  The  XOR  network  is  shown  in  Fig.  1.  Each 
input  unit  receives  one  digit  of  a  two-digit  binary  num¬ 
ber.  The  target  x(  for  the  output  unit  is  1  if  the  number 
of  l's  in  the  input  is  odd  and  0  otherwise.  Unit  5  is  a 
threshold  unit,  i.e.,  it  biases  the  total  input  to  units  3  and 
4  so  as  to  provide  a  threshold  which  must  be  exceeded  if 
these  units  are  to  turn  on.  Unit  5  feeds  back  on  itself  so 
as  to  stay  turned  on  always.  The  feedforward  exclusive 
OR  network  used  by  Rumelhart,  Hinton,  and  Williams 
is  completely  equivalent  to  this  network  if  the  backward 
connection  from  unit  4  to  unit  3  is  omitted  and  if  the 
feedback  loop  in  unit  5  has  an  infinite  positive  magni- 


FIG.  1.  XOR  network  with  recurrent  connections. 


tude.  In  practice  I  made  the  magnitude  of  the  loop 
merely  large  and  was  able  to  reproduce  the  behavior  of 
the  Rumelhart  network. 

The  network  with  the  backward  connection  performed 
only  modestly  faster  than  the  network  without  this  con¬ 
nection.  *  The  main  difference  in  the  networks  was  in  the 
distribution  of  final  weights.  Both  networks  had  similar 
attractors  which  could  be  characterized  by  the  final  root 
mean  square  weight  per  connection  (wn).  The  attrac¬ 
tor  which  I  denote  by  A  had  Wn,,,  “3.4,  while  the  attrac¬ 
tor  denoted  by  B  had  Wfm,  — 10.7.  The  network  without 
the  backward  connection  converged  onto  attractors  A 
and  B  approximately  85%  and  15%  of  the  trials,  respec¬ 
tively,  whereas  the  network  with  the  backward  connec¬ 
tion  converged  onto  attractors  A  and  B  approximately 
52%  and  48%  of  the  trials,  respectively.  Only  in  one  trial 
out  of  480  did  the  recurrent  network  fail  to  converge 
onto  a  global  minimum.  Each  pattern  was  presented  to 
the  recurrent  network  approximately  200  times.  The 
final  solutions  were  insensitive  to  the  initial  value  of  x 
which  indicates  that  the  attractors  of  Eqs.  ( 1 )  have  large 
basins  of  convergence. 

It  is  worthwhile  to  compare  the  RBP  network  with  the 
master/slave  network  of  Lapedes  and  Farber.9  The  slave 
network  corresponds  to  my  forward-propagation  net¬ 
work.  If  we  suppose  it  has  N  nodes  then  the  master  net¬ 
work  determines  the  weights  of  the  slave  network  by  in¬ 
tegrating  N2  equations,  each  of  which  has  a  form  similar 
to  Eqs.  (1),  but  with  slave  weight  matrix  elements  as 
dynamical  variables  and  a  rank-4  matrix  as  the  master’s 
weight  matrix.  The  weight  matrix  of  the  master  network 
has  a  simple  symmetric  form  with  at  most  NiN  + 1  )/2 
nonvanishing  independent  components.  These  com¬ 
ponents  require  additional  storage  beyond  the  Y2  com¬ 
ponents  of  the  slave's  weight  matrix.  The  RBP  network, 
on  the  other  hand,  requires  the  integration  of  N2+2N 
equations  and  no  additional  storage.  2 N  of  these  equa¬ 
tions  correspond  to  Eqs.  (1)  and  (10).  The  remaining 
N2  equations  have  a  simple  outer  product  form  (cf.  Eq. 
(11)]  and  are  quite  trivial  to  implement.  The  conclusion 
is  that  the  RBP  network  is  an  architecturally  simpler 
network  than  the  master/slave  network  and  requires  less 
memory. 
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The  master/slave  network  directly  minimizes  the  aver¬ 
age  of  £  over  all  input/output  associations.  This  average 
is  denoted  by  (E).  The  master  equations  are  guaranteed 
to  converge  to  at  least  a  local  minimum  of  ( E )  because 
(£)  is  a  Liapunov  function  for  the  equations. 10  On  the 
other  hand,  £  is  a  Liapunov  function  for  Eq.  (3)  of  the 
RBP  network  only  in  the  case  of  a  single  input/output 
association.  For  multiple  associations  the  RBP  network 
is  guaranteed  to  converge  only  in  a  probabilistic  sense 
and  under  certain  technical  conditions.  It  was  noted  by 
Amari2  that  gradient-descent  algorithms,  such  as  RBP, 
converge  to  a  minimum  point  of  <£)  to  within  a  small 
fluctuating  term  provided  that  the  input/output  sequence 
is  an  ergodic  random  sequence  and  provided  that  (£)  has 
a  unique  minimum.  Experimentally  it  is  found  that 
RBP,  like  standard  back-propagation,  converges  robustly 
albeit  after  very  many  iterations.  A  detailed  computa¬ 
tional  comparison  of  RBP  and  master/slave  has  yet  to  be 
performed. 

The  RBP  algorithm  is  better  suited  for  hardware  im¬ 
plementation  than  the  8  rule  for  two  reasons.  First,  the 
algorithm  is  expressed  completely  in  differential  equa¬ 
tions  and  therefore  can  be  implemented  in  analog  very 
large-scale  integration.  This  eliminates  the  timing  and 
synchronization  problems  which  appear  in  digital  im¬ 
plementations  of  the  standard  8  rule.  Second,  the  RBP 
algorithm  vectorizes  naturally.  This  is  because  the  units 
are  homogeneous,  i.e.,  the  input,  hidden,  and  output 
units  all  obey  the  same  differential  (difference)  equa¬ 
tions — only  the  components  of  the  constant  vectors  I  and 
J  serve  to  distinguish  the  roles  of  the  units. 

The  author  wishes  to  acknowledge  very  fruitful  discus¬ 
sions  with  Robert  Jenkins  and  Ben  Yuhas.  Liam  Healy 
also  contributed  in  the  early  discussions.  This  work  was 
supported  in  part  by  Grant  No.  AFOSR-87-0354  from 
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where  the  sum  is  over  all  input/output  associations.  From  (18)  it  follows  that  the  over  networks  with  first  order  units  alone  A  detailed  discussion  of  the 

gradient  for  E,^  is  simply  the  sum  of  the  gradients  for  each  association,  hence  the  backpropagation  formalism  applied  to  higher  order  networks  is  beyond  the  scope  of 

corresponding  gradient  decent  equation  has  the  farm.  Ibis  paper.  Instead,  the  adaptive  equations  for  a  network  with  purely  n-th  order  units 

will  be  presented  as  an  example  of  the  formalism.  To  this  end  consider  a  dynamical 
dwjj/dt  =  ill  y°*;Ia)  x“  [a]  .  (19)  system  of  the  form 
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where  S„  are  the  elements  of  the  identity  matrix.  One  can  simplify  the  where  bk  -  gUuk)Jk •  Now,  from  (B.l)  it  is  clear  that  the  local  stability  of 

right  hand  side  of  £q.  (A.I)  by  substituting  Kq.  (A. 2)  into  Eq.  (A.I)  and  the  forward  equations  depends  on  the  eigenvalues  of  the  matrix  L  and 

performing  (he  summation  over  j.  Also  the  left  hand  side  can  be  expressed  from  Eq.  (B.2)  it  is  clear  that  the  local  stability  of  backward  propagation 
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Abstract 

"back-propagation"  is  the  name  given  to  a  family  of  numerical  techiques  and  adaptive  models 
which  have  had  a  significant  impact  on  neural  computation  and  optimization.  The  classical 
numerical  algorithm  applies  to  discrete  feedforward  networks  only.  The  extension  of  these  ideas  to 
recurrent  networks  leads  naturally  to  a  continuous-time  formalism  which  may  map  onto  collective 
physical  systems. 

The  characteristic  features  of  the  formalism  are  presented  without  going  too  deeply  into  obscure 
mathematical  details.  The  distinctions  between  the  physical  approach  and  the  algorithmic  approach 
are  emphasized. 

Recent  developments  in  learning  time-dependent  states  are  discussed  and  finally,  physical  and 
biological  plausibility  concerns  are  discussed. 


1 


Introduction 

The  problem  of  loading  neural  networks  with  a  nonlinear  map  can  be  likened  to  the  problem 
of  finding  the  parameters  in  a  multidimensional  nonlinear  curve  fit.  The  traditional  way  of 
accomplishing  this  is  to  minimize  a  measure  of  the  error  between  the  actual  output  and  the  "target" 
output  Many  useful  techniques  exist,  but  the  most  common  methods  are  methods  which  make  use 
of  gradient  information.  In  general,  if  there  are  N  free  parameters  in  the  objective  function,  the 
number  of  operations  required  to  calculate  the  gradient  numerically,  is  at  best  proportional  to  N^. 
Neural  networks  are  special  because  their  mathematical  form  permits  two  tricks  (to  be  discussed 
below)  which  reduce  the  complexity  of  the  gradient  calculation.  When  these  two  tricks  are 
implemented,  the  gradient  calculation  scales  linearly  with  the  number  of  parameters  (weights), 
rather  than  quadratically.  The  resulting  algorithm  is  known  as  a  back-propagation  algorithm. 

Classical  back-propagation  was  introduced  to  the  neural  network  community  by  Rumelhart, 
Hinton  and  Williams  (1986).  Essentially  the  same  algorithm  was  developed  independently  by 
Werbos  (1974)  and  Parker  (1982)  in  different  contexts,  le  Cun  (1988)  has  provided  a  brief 
overview  of  back-propagation  pre-history  and  stresses  that  the  independent  discovery  of  the 
technique  and  its  interpretation  in  the  context  of  connectionist  systems  is  a  recent  and  important 
development  He  points  out  that  within  the  framework  of  optimal  control  the  essential  features  of 
the  algorithm  were  known  as  early  as  1969  (Bryson  and  Ho,  1969). 

In  this  paper,  the  term  "back-propagation"  will  be  used  generically  to  refer  to  any  algorithm  or 
dynamical  system  which  calculates  the  gradient  by  exploiting  the  two  tricks.  Furthermore,  since 
one  can  write  a  back-propagation  routine  for  evaluating  the  gradient  and  then  use  this  routine  in  any 
prepackaged  numerical  optimization  package,  it  is  reasonable  to  take  the  position  that  the  term 
"back-propagation"  should  be  attached  to  the  way  the  gradient  is  calculated  rather  than  to  the 
particular  algorithm  for  using  the  gradient  (e.g.  conjugate  gradient,  line  search,  etc.). 

If  neural  networks  were  merely  clever  numerical  algorithms  it  would  be  difficult  to  completely 
account  for  the  frenzy  now  associated  with  the  field.  To  my  mind,  much  of  the  excitement  is  due  to 
the  work  of  Hopfield  (1982)  who  made  explicit  the  profound  relationship  between  information 
storage  and  dynamically  stable  configurations  of  collective  physical  systems.  Hopfield  nets  consist 
of  interacting  spins  which  together  form  a  system  known  as  a  spin  glass.  Spin  glasses  are  the 
classic  example  of  a  collective  physical  system.  The  relevent  physical  property  of  spin  glasses 
which  make  them  useful  for  computation  is  that  the  collective  interactions  between  all  the  spins  can 
result  in  stable  patterns  which  can  be  identified  with  stored  memories.  Although  it  may  not  be 
particularly  useful  for  practical  computing,  the  Hopfield  net  serves  as  an  explicit  example  of  the 
principle  of  collective  computation.  Digital  computers,  on  the  other  hand,  can  compute  because  they 
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are  physical  realizations  of  finite  state  machines.  In  digital  computers  collective  dynamics  does  not 
play  a  role  at  the  algorithm  level,  although  it  certainly  plays  a  role  at  the  implementation  level  since 
the  physics  of  transistors  is  collective  physics.  Collective  computation  is  the  idea  that  collective 
dynamics  plays  an  important  role  at  the  algorithm  level  as  well  as  at  the  implementation  level.  The 
observation  that  collective  dynamics  can  play  a  role  at  both  levels  suggests  that  an  efficient 
approach  would  be  to  use  the  same  collective  dynamics  at  both  levels.  This  is  what  one  might  call 
a  physical  approach  to  computation.  Therefore,  rather  than  have  machine  independent  algorithms, 
one  would  have  just  the  opposite  extreme,  in  which  the  implementation  medium  would  necessarily 
influence  the  design  of  algorithms.  The  physical  approach  constrains  neural  network  models  to  be 
plausible  as  collective  physical  dynamical  systems.  The  resulting  "dynamical  algorithms"  could 
then  fully  exploit  the  collective  behavior  of  physical  hardware; 

Recurrent  back-propagation  (RBP)  is  a  non- algorithmic  continuous-time  formalism  for 
adaptive  recurrent  and  nonrecurrent  networks  in  which  the  physical  aspects  of  the  computation  are 
stressed  (Pineda,  1987a,  1987b,  1988).  The  formalism  is  expressed  in  the  language  of  differential 
equations  so  that  the  connection  to  collective  physical  systems  is  more  natural.  RBP  can  be  put  into 
an  algorithmic  form  to  optimize  the  performance  of  the  network  on  digital  machines,  nevertheless, 
as  shall  be  discussed  below,  the  intent  of  the  formalism  is  to  stay  as  close  to  collective  physics  and 
dynamics  as  possible. 

RBP  has  proven  to  be  a  rich  and  useful  computational  tool.  Qian  and  Sejnowski  (1988)  have 
demonstrated  that  a  recurrent  back-propagation  network  can  be  trained  to  calculate  stereo  disparity. 
This  results  in  a  network  similar  to  that  of  Marr  and  Poggio  (1976).  Barhen  et  al.  (1989a)  have 
used  the  method  to  train  networks  on  inverse  kinematics  for  robotic  applications.  The  formalism 
has  also  been  fertile  soil  for  theoretical  developements.  Pearlmutter  (1988)  has  extended  the 
technique  to  time  dependent  trajectories  while  Simard  and  Ballard  (1988)  have  investigated  its 
convergence  properties. 

Overview  of  the  Formalism 

The  class  of  neural  network  models  which  can  be  trained  by  RBP  is  very  general,  but  it  is 
useful  to  pick  a  definite  system  as  an  example,  therefore  consider  a  neural  network  model  based  on 
differential  equations  of  the  form 

Txdxi/dt  =  -xi  +  I  wjj  f(xj)  +  Ij  ( 1 ) 

j 

The  vector  x  represents  the  state  vector  of  the  network,  I  represents  an  external  input  vector  and  w 
represents  a  matrix  of  coupling  constants  (weights)  which  represent  the  strengths  of  the  interactions 
between  the  various  neurons.  The  relaxation  time  scale  is  Tx.  By  hypothesis,  the  vector  valued 
function  f(xj)  is  differentiable  and  chosen  so  as  to  give  the  system  appropriate  dynamical  properties. 
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For  example,  biologically  motivated  choices  are  the  logistic  or  hyperbolic  tangent  functions 
(Cowan,  1968).  When  the  matrix  w  is  symmetric  this  system  corresponds  to  the  Hopfield  model 
with  graded  neurons  (1984). 

In  general ,  the  solutions  of  equation  (1)  exhibit  oscillations,  convergence  onto  isolated  fixed 
points  and  chaos.  For  our  purposes,  convergence  onto  isolated  fixed  points  is  the  desired  behavior, 
because  we  use  the  value  of  the  fixed  point  as  the  output  of  the  system.  When  the  network  is 
loaded,  the  weights  are  adjusted  so  that  the  output  of  the  network  is  the  desired  output 

There  are  several  ways  to  guarantee  convergence.  One  way  is  to  impose  structure  on  the 
connectivity  of  the  networks  ,  e.g.  a  lower  triangular  weight  matrix  or  a  symmetric  weight  matrix. 
Symmetry,  although  mathematically  elegant,  is  quite  stringent  because  it  constrains  microscopic 
connectivity  by  requiring  pairs  of  neurons  to  be  symmetrically  connected.  A  less  stringent 
constraint  is  that  of  Guez  et  al.  (1988)  who  used  Gersgorin's  eigenvalue  localization  theorem  to 
show  that  asymptotic  stability  can  be  obtained  by  imposing  constraints  on  the  row  norm  of  the 
matrix 

Mj  =  Sij  -  wij  f  (xj)  (2) 

where  5y  are  the  elements  of  the  identity  matrix  and  f  (xp  is  the  derivative  of  f(xj). 

If  the  feedforward,  symmetry  or  Guez  stabilty  conditions  are  imposed  as  initial  conditions  on 
a  network,  gradient  descent  dynamics  will  typically  converge  onto  a  network  which  violates  the 
conditions.  Nevertheless,  this  author  has  never  observed  an  initially  stable  network  becoming 
unstable  while  undergoing  simple  gradient  descent  dynamics.  This  fact  points  out  that  the  stability 
conditions  are  merely  sufficient  conditions  —  they  are  not  necessary.  This  fact  also  motivates  the 
stability  conjecture  upon  which  recurrent  back-propagation  is  based:  that  if  the  initial  network  is 
stable,  then  the  gradient  descent  dynamics  will  not  change  the  stability  of  the  network.  The  reader 
should  note  the  following  caveat:  that  this  conjecture  is  a  statement  about  the  kinds  of  problems 
attacked  by  this  author  rather  than  a  statement  about  rigourous  mathematics. 

In  gradient  descent  learning,  the  computational  problem  is  to  optimize  an  objective  function 
whose  free  parameters  are  the  weights.  Let  the  number  of  weights  be  denoted  by  "N”  and  let  the 
number  of  processing  units  be  denoted  by  "n".  Then,  N  is  proportional  to  n^  if  the  fan-in/fan-out 
of  the  units  is  proportional  to  n.  In  a  neural  network  the  evaluation  of  an  objective  function  requires 
0(N)  or  O(n^)  operations^).  Accordingly,  to  calculate  the  gradient  of  the  objective  function  by 
numerical  differentiation  requires  O(N^)  or  O(n^)  calculations.  For  big  problems,  i.e.  problems 
with  lots  of  connections  this  becomes  intractable  very  rapidly.  This  notion  of  big  should  not  be 
confused  with  the  difficulty  of  a  problem  in  the  sense  of  whether  a  problem  is  NP  complete  or  not 

(1)  The  notation  O(n^)  means  that  in  the  large  n  limit  the  number  of  operations  is  bounded  by  Cn^ 
where  C  is  a  constant. 
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Furthermore,  the  scaling  referred  to  here  should  not  be  confused  with  the  number  of  gradient 
evaluations  required  for  convergence  to  a  solution.  Indeed,  for  some  problems,  e.g.  parity,  the 
required  number  of  gradient  evaluations  may  diverge  at  critical  training  set  sizes  (Tesauro,  1987). 

Now,  as  already  mentioned,  back-propagation  adaptive  dynamics  is  based  on  gradient  descent 
and  exploits  two  tricks  to  reduce  the  amount  of  computation.  The  first  trick  uses  the  fact  that,  for 
equations  of  the  form  (1),  the  gradient  of  an  objective  function  E(x°)  can  be  written  as  an 
outer-product,  i.e. 

VWE  =  y°f  (x°)  T  .  (3) 

Where  x°  is  the  fixed  point  of  eqn.  (1)  and  where  the  "error  vector"  y°  is  given  by 

y°=(LT)-ij  (4) 

where  LT  is  the  transpose  of  the  matrix  defined  eqn.  (2)  n  x  n  matrix  and  J  is  an  external  error 
signal  which  depends  on  the  objective  function  and  on  x°.  This  trick  reduces  the  computational 
complexity  of  the  gradient  calculation  by  a  factor  of  n  because  L*1  can  be  calculated  by  direct  matrix 
inversion  in  O(n^)  operations  and  because  x°  can  be  calculated  in  only  0(n2)  calculations.  Thus 
the  entire  calculation  scales  like  O(n^)  or  OfN^/2).  The  second  trick  exploits  the  fact  that  y°  can  be 
calculated  by  relaxation  or  equivalently  it  is  the  (stable)  fixed  point  of  the  following  couple  set  of 
linear  differential  equation: 

Tydyi/dt  =  -yi  +f  (xi)Iwjiyj  +  Jj  (5) 

j 

A  form  of  this  equation  was  derived  by  Pineda,  (1987a).  A  discrete- time  version  was  derived 
independently  by  Almeida  (1987).  To  relax  y  (i.e.  to  integrate  eqn.  (5)  until  y  reaches  steady  state) 
requires  0(n2)  operations  per  time  step.  The  number  of  time  steps  is  independent  of  n.  Therefore 
the  calculation  of  y°  is  0(n2)  or  O(N).  The  method  is  computationally  efficient  provided  the 
network  is  sufficiently  large  and  sparse  and  provided  that  the  fixed  points  are  not  marginally  stable. 
These  results  are  summarized  in  table  1.  Note  that  the  two  back-propagation  algorithms  scale  like 
O(N),  but  this  hides  the  constants  of  proportionality  which  for  feed-forward  networks  depends  on 
the  number  of  layers  where  as  for  recurrent  networks  the  constant  of  proportionality  depends 
strongly  on  the  eigenvalues  of  the  L  matrix.  Indeed,  if  the  fixed  points  are  marginally  stable,  the 
number  of  iterations  required  to  converge  onto  x°  and  y°  may  diverge. 


numerical  algorithm 

Worst  case  (e.g.  numerical  differentiation) 
matrix  inversion  (e.g.  gaussian  elimination) 
matrix  inversion  by  relaxation  (e.g.  RBP) 
recursion  (e.g.  classical  feed-forward  back-propagation) 

Table  1.  Scaling  of  various  algorithms  with  the  number  of  connections 


0(N) 

0(N) 
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For  all  its  faults,  back-propagation  has  permitted  optimization  to  be  applied  to  problems  which 
were  previously  considered  numerically  intractable.  The  0(N)  scaling  of  the  gradient  calculation  is 
arguably  the  single  most  important  reason  that  back-propagation  algorithms  have  made  such  an 
impact  The  idea  of  using  gradient  descent  is  certainly  not  new,  but  whereas  it  was  previously 
tractable  on  small  problems  only,  it  is  now  tractable  on  big  problems  to.  It  is  interesting  to  observe 
that  a  similar  situation  arose  after  the  development  of  the  FFT  algorithm.  The  idea  of  numerical 
fourier  transforms  had  been  around  for  a  long  time  before  the  FFT,  but  the  FFT  cau  .  d  a 
computational  revolution  by  reducing  the  complexity  of  an  n-point  fourier  transform,  from  O(n^) 
to  0(n-log(n) ). 

Dynamical  vs  Algorithmic  approaches 

Back-propagation  algorithms  are  usually  viewed  from  an  algorithmic  viewpoint  For  example, 
the  gradient  descent  version  of  the  algorithm  is  expressed  in  the  following  pseudo-code: 

whilefE  >  e) 

{ 

initialize  weight  change  Aw  =  0 
repeat  for  each  pattern 
{ 

relax  eqn.  (1)  to  obtain  x° 
relax  eqn.  (4)  to  obtain  y° 
calculate  gradient  VE  =  y°  f(x°  )T 
accumulate  gradients  Aw  =  Aw  +  VE 

} 

update  weights  w  =  w  +  Aw 

} 

Note  that  all  the  patterns  are  presented  before  a  weight  update.  On  the  other  hand,  a  "dynamical 
algorithm"  can  be  obtained  by  replacing  the  weight  update  step  with  a  differential  equation,  i.e. 

Xwdwy/dt  =  yjf(xj).  (6) 

and  integrating  it  simultaneously  with  the  forward-propagation  and  backward-propagation 
equations.  A  constant  pattern  is  presented  through  the  input  pattern  vector  I  and  the  error  signal  is 
presented  through  the  error  vector  J.  The  dynamics  of  this  system  is  capable  of  learning  a  single 
pattern  so  long  as  the  relaxation  time  of  the  forward  and  backward  propagations  ( xx  and  Xy)  is 
much  slower  than  the  relaxation  time  of  the  weights,  xw  .  Since  the  forward  and  backward 
equations  settle  rapidly  after  a  presentation,  the  outer  product  yf(x)T  is  a  very  good  approximation 
for  the  gradient  during  most  of  the  integration.  To  learn  multiple  patterns,  the  patterns  must  be 
switched  slowly  compared  to  the  settling  time  of  the  forward  and  backward  equations,  but  rapidly 
compared  to  xw,  the  time  scale  over  which  the  weights  change. 
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The  conceptual  advantage  of  this  approach  is  that  one  now  has  a  dynamical  system  which  can 
be  studied  and  perhaps  used  as  a  basis  for  models  of  actual  physical  or  biological  systems.  This  is 
not  to  say  that  merely  converting  an  algorithm  into  a  dynamical  form  makes  it  biologically  or 
physically  plausible.  It  simply  provides  a  starting  point  for  further  development  and  investigation. 

Intuition  and  formal  results  concerning  algorithmic  models  do  not  necessarily  apply  to  the 
corresponding  dynamical  models.  For  example,  consider  the  well  known  "fact"  that  gradient 
descent  is  a  poor  algorithm  compared  to  conjugate  gradient.  In  fact  this  conventional  wisdom  is 
incorrect  when  it  comes  to  physical  dynamical  systems.  The  reason  is  that  the  disease  which  makes 
gradient  descent  inefficient  is  a  consequence  of  discretization.  The  difficulty  occurs  when 
descending  down  a  long  narrow  valley.  Gradient  descent  can  wind  up  taking  many  tiny  steps 
crossing  and  re-crossing  the  actual  gradient  direction.  This  is  inefficient  because  the  gradient  must 
be  recomputed  for  each  step  and  because  the  amount  of  computation  required  to  recalculate  the 
gradient  from  one  step  to  the  next  is  approximately  constant.  Conjugate  gradient  is  a  technique 
which  assures  that  the  new  direction  is  conjugate  to  the  previous  direction  and  therefore  avoids  the 
problem.  Accordingly  larger  steps  may  be  taken  and  less  gradient  evaluations  are  required. 

On  the  other  hand  gradient  descent  is  quite  satisfactory  in  physical  dynamical  systems  simply 
because  time  is  continuous.  The  "steps"  are  by  definition  infinitely  small  and  the  gradient  is 
evaluated  continuously.  No  repeated  crossing  of  the  gradient  direction  occurs.  For  the  same 
reason,  the  ultimate  performance  of  physical  neural  networks  cannot  be  determined  from  how 
quickly  or  how  slowly  a  "neural"  simulation  runs  on  a  digital  machine.  Instead  one  must  integrate 
the  simultaneous  equations  and  measure  how  long  it  takes  to  learn,  in  multiples  of  the  fundamental 
time  scales  of  the  equations.  As  an  example,  consider  the  following  illustrative  problem.  Chose 
input  and  output  vectors  to  be  randomly  selected  5  digit  binary  vectors  scaled  between  0.1  and  0.9. 
Use  a  network  with  two  layers  of  five  units  each  with  connections  going  in  both  directions 
(50  weights).  For  dimensionless  time  scales  chose  xx  =  Xy  =  1.0,  xw  =32  xx  and  select  a  new 
pattern  at  random  every  4xx.  The  equations  may  be  integrated  crudely,  e.g.  use  the  Euler  method 
with  (At  =  0.02  xx  ).  One  finds  that  the  error  reaches  E  =  0.1  in  approximately  4x10^  xx  or 
after  10^  presentations.  Figure  1.  shows  the  error  as  a  function  of  time. 

//INSERT  FIG.  1  HERE// 

To  estimate  the  performance  of  an  electronic  physical  system  we  can  replace  these  time  scales 
with  electronic  time  scales.  Therefore,  suppose  patterns  are  presented  every  10'^sec  (100  kHz). 
This  is  the  performance  bottleneck  of  the  system,  since  the  relaxation  time  of  the  circuit,  xx  is  then 
approximately  2.5x1 0‘^sec,  which  is  slow  compared  with  what  can  be  achieved  in  analog  VLSI. 
Hence  in  this  case  the  patterns  would  be  learned  in  approximately  10  milliseconds. 
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Unlike  simple  feedforward  networks,  recurrent  networks  exhibit  dynamical  phenomena.  For 
example,  a  peculiar  phenomenon  can  occur  if  a  recurrent  network  is  trained  as  an  associative 
memory  to  store  multiple  memories:  it  is  found  that  the  objective  function  can  be  reduced  to  some 
very  small  value,  yet  when  the  network  is  tested  for  recall,  the  supposedly  stored  memory  is 
missing!  This  is  due  to  a  fundamental  limitation  of  gradient  descent  Gradient  descent  is  capable  of 
moving  existing  fixed  points  only.  It  cannot  create  new  fixed  points.  To  create  new  fixed  points 
requires  a  technique  whereby  some  degrees  of  freedom  in  the  network  are  clamped  during  the 
loading  phase  and  released  during  recall  phase.  The  analogous  technique  in  feed-forward  networks 
is  called  teacher  forcing.  It  can  be  shown  that  this  technique  causes  the  creation  of  new  fixed 
points.  Unfortunately,  after  the  suppressed  degrees  of  freedom  are  released,  there  is  no  guarantee 
that  the  system  is  stable  with  respect  to  the  suppressed  degrees  of  freedom.  Therefore  the  fixed 
points  sometimes  turn  out  to  be  repellors  instead  of  attractors.  In  feed-forward  nets  teacher  forcing 
causes  no  such  difficulties  because  there  is  no  dynamics  in  feed-forward  networks  and  hence  no 
attractors  or  repellors. 

Recent  Developments 

Zak,  (1988)  has  suggested  the  use  of  fixed  points  with  infinite  stability  in  recurrent  networks. 
These  fixed  points,  denoted  "terminal  attractors",  have  two  properties  which  follow  from  their 
infinite  stability.  First,  their  stability  is  always  gauranteed,  hence  the  repellor  problem  never 
occurs,  and  second  trajectories  converge  onto  them  in  a  finite  amount  of  time,  rather  than  an  infinite 
amount  of  time.  In  particular,  if  a  terminal  attractor  is  used  in  the  weight  update  equation,  a 
remarkable  speedup  in  learning  time  occurs,  see  e.g.  Barhen  (1989a).  These  interesting  properties 
are  a  consequence  of  the  fact  that  the  attractors  violate  the  Lipschitz  condition. 

Pearlmutter,  (1989)  has  extended  the  recurrent  formalism  to  include  time-dependent  trajectories 
(time-dependent  recurrent  back-propagation  or  TDRBP).  In  this  approach  the  objective  function  of 
the  fixed  point  is  replaced  with  an  objective  functional  of  the  trajectory.  The  technique  is  the 
continous  time  generalization  of  the  sequence  generating  network  discussed  by  Rumelhart  et  aL 
(1986).  Like  all  back-propagation  algorithms  the  amount  of  calculation  scales  like  0(N)  for  each 
gradient  evaluation.  However,  like  the  Rumelhart  network,  it  requires  that  the  network  be  unfolded 
in  time  during  training.  Hence  the  storage  during  training  scales  like  O(mN)  where  m  is  the 
number  of  unfolded  time  steps.  Furthermore,  the  technique  is  acausal  in  that  the  back-propagation 
equation  must  be  integrated  backwards  in  time.  This  merely  reflects  the  fact  that  one  is  solving  a 
two-point  boundary  problem  of  the  kind  familiar  from  control  theory.  For  problems  where  the 
target  trajectories  are  known  apriori  and  on-line  learning  is  not  required,  this  is  the  technique  of 
choice. 
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On  the  other  hand  a  causal  algorithm  has  been  suggested  by  Williams  and  Zipser  (1989).  This 
algorithm  does  not  take  advantage  of  the  back-propagation  tricks  and  therefore  the  complexity 
scales  like  0(N^)  for  each  gradient  evaluation.  Nevertheless,  for  small  problems  where  on-line 
learning  is  required  it  is  the  technique  of  choice. 

Both  techniques  seek  to  minimize  a  measure  of  the  error  between  a  target  trajectory  and  an 
actual  trajectory  by  performing  gradient  descent.  Only  the  method  used  for  the  gradient  evaluation 
differs.  Therefore  one  expects  that,  to  the  extent  that  on-line  training  is  not  an  issue  and  to  the 
extent  that  complexity  is  not  an  issue,  one  could  use  the  two  techniques  interchangably  to  create 
networks. 

Both  techniques  can  suffer  from  the  repellor  problem  if  an  attempt  is  made  to  introduce 
multiple  attractors.  As  before,  this  problem  could  be  solved  by  introducing  a  time  dependent 
terminal  attractor. 

Discussion 

Biologically  and  physically  plausible  adaptive  systems  should  satisfy  certain  constraints.  1) 
They  should  scale  well  with  connectivity,  e.g.  linearly  2)  they  should  require  no  global 
synchronization  3)  they  should  use  low  precision  components  and  4)  they  should  not  impose 
unreasonable  structural  contraints,  e.g.  symmetric  weights  or  bi-directional  signal  propagation. 
Back-propagation  algorithms  in  general  and  RBP  and  TDRBP  in  particular  can  be  considered  in  the 
light  of  each  of  these  constraints. 

Linear  scaling  of  the  gradient  calculation  in  back-propagaton  algorithms  is  a  consequence  of  the 
local  nature  of  the  computation,  i.e.  that  each  unit  only  requires  information  from  the  units  to  which 
it  is  connected.  This  notion  of  locality,  which  arises  from  the  analysis  of  the  numerical  algorithm  is 
distinct  from  the  notion  of  spatial  locality,  which  is  a  constraint  imposed  by  physical  space  on 
physical  networks.  Spatial  locality  is  how  one  avoids  the  O(n^)  growth  of  wires  in  networks. 
Both  locality  constraints  could  be  satisfied  by  physical  back-propagation  networks. 

Global  synchronization  requires  global  connections,  therefore  it  is  undesirable  if  the  network  is 
to  scale  up.  In  one  sense,  the  problem  of  synchronization  has  been  eliminated  in  RBP  because  there 
is  no  longer  any  need  for  separate  forward  ,  backward  and  update  steps,  indeed  equations  (1),  (5) 
and  (6)  are  "integrated"  simultaneously  by  the  dynamical  system  as  it  evolves.  There  is  another 
sense  in  which  synchronization  causes  difficulties.  In  physical  systems  and  in  massively  parallel 
digital  simulations, time  delays  and  asynchronous  updates,  can  give  rise  to  chaotic  or  exponential 
stochastic  behavior  (Barhen,  1989b).  Barhen  et  al.  have  shown  that  this  "emergent  chaos"  can  be 
suppressed  easily  by  the  appropriate  choice  of  dynamical  parameters. 
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It  is  still  an  open  question  as  to  whether  back-propagation  algorithms  require  low  precision  or 
high  precision  components.  Formal  results  suggest  that  some  problems,  like  parity  in  single  layer 
nets  (Minsky,  1969),  may  lead  to  exponential  growth  of  weights.  In  practice  it  appears  that  16  bits 
of  precision  for  the  weights  and  8  bits  of  precision  for  the  activations  and  error  signals  are  sufficient 
for  many  useful  problems  (Durbin,  1987). 

Structurally,  RBP  and  TDRBP  impose  no  constraints  on  the  weight  matrix.  Furthermore,  in 
RBP  networks  there  appears  to  be  no  need  to  take  special  measures  to  insure  the  stability  of  the 
network  while  undergoing  training.  This  would  help  the  biological  plausiblity  of  the  model  were  it 
not  for  the  requirement  that  the  connections  be  bi-directional.  Bi-directionality  is  arguably  the 
biggest  plausibility  problem  with  the  algorithms  based  on  backpropagation.  Biologically,  this 
requires  bi-directional  synapses  or  separate,  but  equal  and  opposite,  paths  for  error  and  activation 
signals.  There  is  no  evidence  for  either  structure  in  biological  systems.  The  same  difficulties  arise 
in  electronic  implementations  where  engineering  solutions  to  this  problem  have  been  developed 
(Furman  and  Abidi,1988),  but  one  would  hope  that  a  better  adaptive  dynamics  would  eliminate  the 
problem  altogether. 
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